[Bugfix]pass TP size to diffusion config by natureofnature · Pull Request #2867 · vllm-project/vllm-omni

natureofnature · 2026-04-17T03:26:21Z

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

Solve issue [CI Failure]: Diffusion X2I(&A&T) · Function Test with H100, test_bagel_expansion.py, openai.InternalServerError: Error code: 500 #2862.
When TP size >1, the CI still uses 1 gpu device, which causes device usage error. This PR uses number of TP_size GPUs for testing.

Reason:

Fix CLI --tensor-parallel-size being silently dropped by the diffusion engine, causing a shape mismatch error during KV cache transfer when running with TP > 1.

Root cause:

OmniDiffusionConfig.from_kwargs() filters kwargs to only include fields defined directly on OmniDiffusionConfig. Since tensor_parallel_size is a field of the nested DiffusionParallelConfig (not a top-level field), it was silently discarded. This meant the DiT stage always ran with TP=1 regardless of the CLI argument.
After PR #2705 introduced _inject_inferred_kv_tp_topology, the KV transfer manager correctly inferred a heterogeneous topology (e.g. from_tp=1, to_tp=2) based on the configured TP sizes. However, the DiT stage was actually running with TP=1 due to the parameter dropping, so it expected full KV heads (e.g. 4) while receiving sliced heads (e.g. 2), resulting in:
shape mismatch: value tensor of shape [47, 2, 128] cannot be broadcast to indexing result of shape [47, 4, 128]

Fix

In OmniDiffusionConfig.from_kwargs(), forward the top-level tensor_parallel_size into parallel_config before field filtering, so CLI arguments propagate correctly to the diffusion engine. If parallel_config already explicitly sets tensor_parallel_size (e.g. from YAML), the existing value is preserved.

Test Plan

In my local environment, I changed DIT devices to 1,2 because the default yaml settings will cause OOM on my GPU if stage 0 and 1 on the same device. The default settings sets the starting offset of DIT to GPU 0.

CUDA_VISIBLE_DEVICES=0,1,2,3 pytest tests/e2e/online_serving/test_bagel_expansion.py::test_bagel[parallel_tp_2]   --run-level advanced_model   -v -s   2>&1 | tee /tmp/test_bagel_tp2.log

Test Result

@yenuo26 @princepride

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
The test results. Please paste the results comparison before and after, or the e2e results.
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
(Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

chatgpt-codex-connector · 2026-04-17T03:26:30Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

natureofnature · 2026-04-17T04:09:13Z

@codex review

chatgpt-codex-connector · 2026-04-17T04:12:58Z

Codex Review: Didn't find any major issues. Keep it up!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

natureofnature · 2026-04-17T07:55:36Z

@NumberWan PTAL

natureofnature · 2026-04-17T10:14:01Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 596da6a029

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Gaohan123 · 2026-04-20T15:57:24Z

Please resolve conflict and recover the skip test in PR #2883

Signed-off-by: natureofnature <wzliu@connect.hku.hk>

yenuo26 · 2026-04-24T06:35:23Z


+# This test uses the default Bagel YAML, and CLI does not control devices.We modify yaml file directly.
+_BAGEL_DEFAULT_YAML = str(
+    Path(__file__).resolve().parents[3] / "vllm_omni" / "model_executor" / "stage_configs" / "bagel.yaml"


please use get_deploy_config_path()

There's no bagel yaml in the deploy folder

Gaohan123 · 2026-04-24T08:39:21Z

Please resolve some comments from the bot

Signed-off-by: natureofnature <wzliu@connect.hku.hk>

Gaohan123 · 2026-04-27T05:01:42Z

Please fix UT CI failure. Thanks

Signed-off-by: natureofnature <wzliu@connect.hku.hk>

natureofnature · 2026-04-27T08:45:04Z

@hsliuustc0106 PTAL

natureofnature · 2026-04-27T08:45:17Z

Please fix UT CI failure. Thanks

Fixed @Gaohan123

Signed-off-by: natureofnature <wzliu@connect.hku.hk>

Gaohan123

It seems that I don't see which skipped test that you recover.

natureofnature · 2026-04-28T01:45:40Z

It seems that I don't see which skipped test that you recover.

@Gaohan123

Gaohan123 · 2026-04-28T06:37:21Z

It seems that I don't see which skipped test that you recover.

@Gaohan123

Sorry for my missing

Signed-off-by: natureofnature <wzliu@connect.hku.hk>

Signed-off-by: natureofnature <wzliu@connect.hku.hk> Signed-off-by: NumberWan <wantszkin2003@gmail.com>

Signed-off-by: natureofnature <wzliu@connect.hku.hk>

Signed-off-by: natureofnature <wzliu@connect.hku.hk> Signed-off-by: sphinxkkkbc <binchengkang8@gmail.com>

Signed-off-by: natureofnature <wzliu@connect.hku.hk>

natureofnature requested a review from hsliuustc0106 as a code owner April 17, 2026 03:26

yenuo26 added the diffusion-x2iat-test label to trigger buildkite x2image + x2audio + x2text series of diffusion models test in nightly CI label Apr 17, 2026

natureofnature force-pushed the bugfix/cli_diffusion_args branch from e76b07a to f3f86b5 Compare April 17, 2026 03:30

chatgpt-codex-connector Bot reviewed Apr 17, 2026

View reviewed changes

Comment thread tests/e2e/online_serving/test_bagel_expansion.py Outdated

Comment thread tests/e2e/online_serving/test_bagel_expansion.py Outdated

natureofnature mentioned this pull request Apr 20, 2026

[RFC]: Bagel Performance Optimization - CFG/TP Mooncake TE Support JiusiServe/vllm-omni#207

Closed

1 task

Gaohan123 added this to the v0.20.0 milestone Apr 20, 2026

natureofnature added 2 commits April 24, 2026 09:56

pass tp size to diffusion config

2daa573

Signed-off-by: natureofnature <wzliu@connect.hku.hk>

update bagel CI to use real tp devices

a41e8f0

Signed-off-by: natureofnature <wzliu@connect.hku.hk>

natureofnature force-pushed the bugfix/cli_diffusion_args branch from 596da6a to a41e8f0 Compare April 24, 2026 02:00

yenuo26 reviewed Apr 24, 2026

View reviewed changes

yenuo26 added ready label to trigger buildkite CI and removed diffusion-x2iat-test label to trigger buildkite x2image + x2audio + x2text series of diffusion models test in nightly CI labels Apr 24, 2026

adjust codes for comments

26352d1

Signed-off-by: natureofnature <wzliu@connect.hku.hk>

update for simple test

029109f

Signed-off-by: natureofnature <wzliu@connect.hku.hk>

yenuo26 added diffusion-x2iat-test label to trigger buildkite x2image + x2audio + x2text series of diffusion models test in nightly CI and removed ready label to trigger buildkite CI labels Apr 27, 2026

update

c7b0c69

Signed-off-by: natureofnature <wzliu@connect.hku.hk>

Gaohan123 reviewed Apr 27, 2026

View reviewed changes

yenuo26 linked an issue Apr 28, 2026 that may be closed by this pull request

[CI Failure]: Diffusion X2I(&A&T) · Function Test with H100, test_bagel_expansion.py, openai.InternalServerError: Error code: 500 #2862

Closed

1 task

natureofnature added 2 commits April 29, 2026 09:55

Merge branch 'main' into bugfix/cli_diffusion_args

048e9e5

use deployment yaml

30882e7

Signed-off-by: natureofnature <wzliu@connect.hku.hk>

NumberWan mentioned this pull request Apr 29, 2026

[Perf] Bagel Performance Nightly CI test #2175

Merged

natureofnature added 5 commits April 29, 2026 06:15

fix device name

f99b0ed

Signed-off-by: natureofnature <wzliu@connect.hku.hk>

set a default tp value to avoid all stages share the same tp size

f8766a0

Signed-off-by: natureofnature <wzliu@connect.hku.hk>

update

250a2fe

Signed-off-by: natureofnature <wzliu@connect.hku.hk>

update

27e7f75

Signed-off-by: natureofnature <wzliu@connect.hku.hk>

update

04b54c2

Signed-off-by: natureofnature <wzliu@connect.hku.hk>

yenuo26 added ready label to trigger buildkite CI and removed diffusion-x2iat-test label to trigger buildkite x2image + x2audio + x2text series of diffusion models test in nightly CI labels Apr 30, 2026

update

53f4522

Signed-off-by: natureofnature <wzliu@connect.hku.hk>

natureofnature force-pushed the bugfix/cli_diffusion_args branch from f94cd0a to 53f4522 Compare April 30, 2026 03:22

Merge branch 'main' into bugfix/cli_diffusion_args

62cad39

hsliuustc0106 merged commit bfb730e into vllm-project:main Apr 30, 2026
7 of 8 checks passed

NumberWan pushed a commit to NumberWan/vllm-omni that referenced this pull request Apr 30, 2026

[Bugfix]pass TP size to diffusion config (vllm-project#2867)

26b26d0

Signed-off-by: natureofnature <wzliu@connect.hku.hk> Signed-off-by: NumberWan <wantszkin2003@gmail.com>

xiaohajiayou pushed a commit to xiaohajiayou/vllm-omni that referenced this pull request Apr 30, 2026

[Bugfix]pass TP size to diffusion config (vllm-project#2867)

9b7acb6

Signed-off-by: natureofnature <wzliu@connect.hku.hk>

lengrongfu pushed a commit to lengrongfu/vllm-omni that referenced this pull request May 1, 2026

[Bugfix]pass TP size to diffusion config (vllm-project#2867)

a86d6cd

Signed-off-by: natureofnature <wzliu@connect.hku.hk>

BeatSeat pushed a commit to BeatSeat/vllm-omni that referenced this pull request May 2, 2026

[Bugfix]pass TP size to diffusion config (vllm-project#2867)

6ca1f8f

Signed-off-by: natureofnature <wzliu@connect.hku.hk>

sphinxkkkbc pushed a commit to sphinxkkkbc/vllm-omni that referenced this pull request May 4, 2026

[Bugfix]pass TP size to diffusion config (vllm-project#2867)

32024da

Signed-off-by: natureofnature <wzliu@connect.hku.hk> Signed-off-by: sphinxkkkbc <binchengkang8@gmail.com>

clodaghwalsh17 pushed a commit to clodaghwalsh17/nm-vllm-omni-ent that referenced this pull request May 12, 2026

[Bugfix]pass TP size to diffusion config (vllm-project#2867)

ec1ba8c

Signed-off-by: natureofnature <wzliu@connect.hku.hk>

Conversation

natureofnature commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Reason:

Root cause:

Fix

Test Plan

Test Result

Uh oh!

chatgpt-codex-connector Bot commented Apr 17, 2026

Uh oh!

natureofnature commented Apr 17, 2026

Uh oh!

chatgpt-codex-connector Bot commented Apr 17, 2026

Uh oh!

natureofnature commented Apr 17, 2026

Uh oh!

natureofnature commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Gaohan123 commented Apr 20, 2026

Uh oh!

yenuo26 Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

natureofnature Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Gaohan123 commented Apr 24, 2026

Uh oh!

Gaohan123 commented Apr 27, 2026

Uh oh!

natureofnature commented Apr 27, 2026

Uh oh!

natureofnature commented Apr 27, 2026

Uh oh!

Gaohan123 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

natureofnature commented Apr 28, 2026

Uh oh!

Gaohan123 commented Apr 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

natureofnature commented Apr 17, 2026 •

edited

Loading

natureofnature commented Apr 17, 2026 •

edited

Loading

Gaohan123 left a comment •

edited

Loading